New space/time tradeoffs for top-k document retrieval on sequences

نویسندگان

Gonzalo Navarro

Sharma V. Thankachan

چکیده

We address the problem of indexing a collectionD = {T1,T2, ...TD} of D string documents of total length n, so that we can efficiently answer top-k queries: retrieve k documents most relevant to a pattern P of length p given at query time. There exist linear-space data structures, that is, using O(n) words, that answer such queries in optimal O(p + k) time for an ample set of notions of relevance. However, using linear space is not sufficiently good for large text collections. In this paper we explore how far the space/time tradeoff for this problem can be pushed. We obtain three results: (1) When relevance is measured as term frequency (number of times P appears in a document Ti), an index occupying |CSA|+o(n) bits answers the query in time O(tsearch(p)+k lg2 k lg n), where CSA is a compressed suffix array indexing D, tsearch is its time to find the suffix array interval of P, and ε > 0 is any constant. (2) With the same measure of relevance, an index occupying |CSA| + n lg D + o(n lgσ + n lg D) bits answers the query in time O(tsearch(p) + k lg∗ k), where lg∗ k is the iterated logarithm of k. (3) When the relevance depends only on the documents, an index occupying |CSA|+ O(n lg lg n) bits answers the query in O(tsearch(p) + k tSA) time, where tSA is the time the CSA needs to retrieve a suffix array cell. On our way, we obtain some other results of independent interest.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improved Compressed Indexes for Full-Text Document Retrieval

We give new space/time tradeoffs for compressed indexes that answer document retrieval queries on general sequences. On a collection of D documents of total length n, current approaches require at least |CSA| + O(n lgD lg lgD ) or 2|CSA| + o(n) bits of space, where CSA is a full-text index. Using monotone minimum perfect hash functions, we give new algorithms for document listing with frequenci...

متن کامل

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

Improved Skips for Faster Postings List Intersection

متن کامل

Colored Range Queries and Document Retrieval

Colored range queries are a well-studied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important one-dimensional colored range queries — colored range listing, colored range top-k queries and colored range counting — and, thus, new bounds fo...

متن کامل

Practical Top-K Document Retrieval in Reduced Space

Supporting top-k document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reduced-space structures that support top-k...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Theor. Comput. Sci.

دوره 542 شماره

صفحات -

تاریخ انتشار 2014

New space/time tradeoffs for top-k document retrieval on sequences

نویسندگان

چکیده

منابع مشابه

Improved Compressed Indexes for Full-Text Document Retrieval

Improved Skips for Faster Postings List Intersection

Improved Skips for Faster Postings List Intersection

Colored Range Queries and Document Retrieval

Practical Top-K Document Retrieval in Reduced Space

عنوان ژورنال:

اشتراک گذاری